# Load packages & import data
library(readr)
library(ggplot2)
library(dplyr)
bikes <- read_csv("https://mac-stat.github.io/data/bikeshare.csv") %>%
rename(rides = riders_registered)Multiple regression principles (Notes)
STAT 155
Notes
Learning goals
Working with multiple predictors in our plots and models can get complicated!
There are no recipes for this process.
BUT there are some guiding principles that assist in long-term retention, deeper understanding, and the ability to generalize our tools in new settings.
By the end of this lesson, you should be familiar with some general principles for…
- incorporating additional quantitative or categorical predictors in a visualization
- how additional quantitative or categorical predictors impact the physical representation of a model
- interpreting quantitative or categorical coefficients in a multiple regression model
Readings and videos
Please watch the following video before class.
Exercises
Let’s revisit the bikeshare data:
Our goal is to understand how / why registered ridership from day to day.
To this end, we’ll build various multiple linear regression models of rides by different combinations of the possible predictors.
# Check out the data
head(bikes)
## # A tibble: 6 × 15
## date season year month day_of_week weekend holiday temp_actual
## <date> <chr> <dbl> <chr> <chr> <lgl> <chr> <dbl>
## 1 2011-01-01 winter 2011 Jan Sat TRUE no 57.4
## 2 2011-01-02 winter 2011 Jan Sun TRUE no 58.8
## 3 2011-01-03 winter 2011 Jan Mon FALSE no 46.5
## 4 2011-01-04 winter 2011 Jan Tue FALSE no 46.8
## 5 2011-01-05 winter 2011 Jan Wed FALSE no 48.7
## 6 2011-01-06 winter 2011 Jan Thu FALSE no 47.1
## # ℹ 7 more variables: temp_feel <dbl>, humidity <dbl>, windspeed <dbl>,
## # weather_cat <chr>, riders_casual <dbl>, rides <dbl>, riders_total <dbl>Exercise 1: Review visualization
Let’s build a model of rides by windspeed (quantitative) and weekend status (categorical).
Write a model statement for this regression model.
Plot & describe, in words, the relationship between these 3 variables.
# Plot of rides vs windspeed & weekend
# HINT: Start with a plot of rides vs windspeed, then add an aesthetic for weekend!Exercise 2: Review model
Let’s build the model. Run the following code:
bike_model_1 <- lm(rides ~ windspeed + weekend, data = bikes)
coef(summary(bike_model_1))
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4738.38053 147.53653 32.116659 1.208405e-141
## windspeed -63.97072 10.45274 -6.119997 1.528443e-09
## weekendTRUE -925.15701 119.86330 -7.718434 3.891082e-14The model formula with our coefficient estimates filled in is therefore:
E[rides | windspeed, weekendTRUE] = 4738.38 - 63.97 * windspeed - 925.16 * weekendTRUE
This model formula is represented by 2 lines, one corresponding to weekends and the other to weekdays. Simplify the model formula above for weekdays and weekends:
weekdays: rides = ___ - ___ windspeed
weekends: rides = ___ - ___ windspeed
Exercise 3: Review coefficient interpretation
The intercept coefficient, 4738.38, represents the intercept of the sub-model for weekdays, the reference category. What’s its contextual interpretation?
The
windspeedcoefficient, -63.97, represents the shared slope of the weekend and weekday sub-models. What’s its contextual interpretation?The
weekendTRUEcoefficient, -925.16, represents the change in intercept for the weekend vs weekday sub-model. What’s its contextual interpretation?
Exercise 4: 2 categorical predictors – visualization
Thus far, we’ve explored a couple examples of multiple regression models that have 2 predictors, 1 quantitative and 1 categorical.
So what happens when both predictors are categorical?!
To this end, let’s model rides by weekend status and season.
The below code plots rides vs season.
Modify this code to also include information about weekend.
HINT: Remember the visualization principle that additional categorical predictors require some sort of grouping mechanism / mechanism that distinguishes between the 2 groups.
# rides vs season
bikes %>%
ggplot(aes(y = rides, x = season)) +
geom_boxplot()
# rides vs season AND weekend
bikes %>%
ggplot(aes(y = rides, x = season, ___ = ___)) +
geom_boxplot()
## Error in parse(text = input): <text>:8:38: unexpected input
## 7: bikes %>%
## 8: ggplot(aes(y = rides, x = season, __
## ^Exercise 5: follow-up
Describe (in words) the relationship of ridership with season & weekend status.
A model of
ridesbyseasonalone would be represented by only 4 expected outcomes, 1 for each season. Considering this and the plot above, how do you anticipate a model ofridesbyseasonandweekendstatus will be represented?- 2 lines, 1 for each weekend status
- 8 lines, 1 for each possible combination of season & weekend
- 2 expected outcomes, 1 for each weekend status
- 8 expected outcomes, 1 for each possible combination of season & weekend
Exercise 6: 2 categorical predictors – build the model
Let’s build the multiple regression model of rides vs season and weekend:
bike_model_2 <- lm(rides ~ weekend + season, bikes)
coef(summary(bike_model_2))
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4260.4492 99.16363 42.9638294 1.384994e-201
## weekendTRUE -912.3324 103.23016 -8.8378473 7.298199e-18
## seasonspring -116.3824 132.76018 -0.8766364 3.809741e-01
## seasonsummer 438.4424 132.06413 3.3199205 9.454177e-04
## seasonwinter -1719.0572 133.30505 -12.8956646 2.081758e-34Thus the model formula with coefficient estimates filled in is given by:
E[rides | weekend, season] = 4260.45 - 912.33 weekendTRUE - 116.38 seasonspring + 438.44 seasonsummer - 1719.06 seasonwinter
- Use this model to predict the ridership on the following days:
# a fall weekday
4260.45 - 912.33*___ - 116.38*___ + 438.44*___ - 1719.06*___
# a winter weekday
4260.45 - 912.33*___ - 116.38*___ + 438.44*___ - 1719.06*___
# a fall weekend day
4260.45 - 912.33*___ - 116.38*___ + 438.44*___ - 1719.06*___
# a winter weekend day
4260.45 - 912.33*___ - 116.38*___ + 438.44*___ - 1719.06*___
## Error in parse(text = input): <text>:2:19: unexpected input
## 1: # a fall weekday
## 2: 4260.45 - 912.33*__
## ^- We only made 4 predictions here. How many possible predictions does this model produce? Is this consistent with your intuition in the previous exercise?
Exercise 7: 2 categorical predictors – interpret the model
Use your above predictions and visualization to fill in the below interpretations of the model coefficients.
Hint: What is the consequence of plugging in 0 or 1 for the different weekend and season categories?
Interpreting 4260: On average, we expect there to be 4260 riders on (weekdays/weekends) during the (fall/spring/summer/winter).
Interpreting -912: On average, in any season, we expect there to be 912 (more/fewer) riders on weekends than on ___.
An alternative interpretation: On average, we expect there to be 912 (more/fewer) riders on weekends than on ___, adjusting for season.
- Interpreting -1719: On average, on both weekdays and weekends, we expect there to be 1719 (more/fewer) riders in winter than in ___.
An alternative interpretation: On average, we expect there to be 1719 (more/fewer) riders in winter than in ___, controlling for weekday status.
Exercise 8: 2 quantitative predictors – visualization
Next, consider the relationship between rides and 2 quantitative predictors: windspeed and temp_feel. Check out the plot of this relationship below.
This reflect the visualization principle that quantitative variables require some sort of numerical scaling mechanism – rides and windspeed get numerical axes, and temp_feel gets a color scale.
Modify the code below to recreate this plot.
bikes %>%
ggplot(aes(y = rides, x = windspeed, ___ = ___)) +
geom_point()
## Error in parse(text = input): <text>:2:41: unexpected input
## 1: bikes %>%
## 2: ggplot(aes(y = rides, x = windspeed, __
## ^Exercise 9: follow-up
Describe (in words) the relationship of ridership with windspeed & temperature.
Exercise 10: 2 quantitative predictors – modeling
Let’s build the multiple regression model of rides vs windspeed and temp_feel:
bike_model_3 <- lm(rides ~ windspeed + temp_feel, data = bikes)
coef(summary(bike_model_3))
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -24.06464 299.303032 -0.08040225 9.359394e-01
## windspeed -36.54372 9.408116 -3.88427585 1.119805e-04
## temp_feel 55.51648 3.330739 16.66791759 4.436963e-53Thus the model formula with coefficient estimates filled in is given by,
E[rides | windspeed, temp_feel] = -24.06 - 36.54 windspeed + 55.52 temp_feel
Interpret the intercept coefficient, -24.06, in context.
Interpret the
windspeedcoefficient, -36.54, in context.Interpret the
temp_feelcoefficient, 55.52, in context.
Exercise 11: Which is “best”?
We’ve now observed 3 different models of ridership, each having 2 predictors. The R-squared values of these models, along with those of the simple linear regression models with each predictor alone, are summarized below.
| model | predictors | R-squared |
|---|---|---|
bike_model_1 |
windspeed & weekend |
0.119 |
bike_model_2 |
weekend & season |
0.349 |
bike_model_3 |
windspeed & temp_feel |
0.310 |
bike_model_4 |
windspeed |
0.047 |
bike_model_5 |
temp_feel |
0.296 |
bike_model_6 |
weekend |
0.074 |
bike_model_7 |
season |
0.279 |
Which model does the best job of explaining the variability in ridership from day to day?
If you could only pick one predictor, which would it be?
What happens to R-squared when we add a second predictor to our model, and why does this make sense? For example, how does the R-squared for model 1 (with both windspeed and weekend) compare to those of model 4 (only windspeed) and model 6 (only weekend)?
Are 2 predictors always better than 1? Provide evidence and explain why this makes sense.
Exercise 12: Principles of interpretation
These exercises have revealed some principles behind interpreting model coefficients, summarized below.
Review and confirm that these make sense.
Principles of interpretation
Consider a multiple linear regression model:
\[E[Y | X_1, X_2, ..., X_p] = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + ... + \beta_p X_p\]
We can interpret the coefficients as follows:
\(\beta_0\) (“beta 0”) is the y-intercept. It describes the average value of \(Y\) when \(X_1, X_2,..., X_k\) are all 0, ie. when all quantitative predictors are set to 0 and the categorical predictors are set to their reference levels.
\(\beta_i\) (“beta i”) is the coefficient of \(X_i\).
If \(X_i\) is quantitative, \(\beta_i\) describes the average change in \(Y\) associated with a 1-unit increase in \(X_i\) while at a fixed set of the other \(X\).
If \(X_i\) represents a category of a categorical variable, \(\beta_i\) describes the average difference in \(Y\) between this category and the reference category, while at a fixed set of the other \(X\).
Extra practice
The following exercises provide extra practice. If you don’t get to these during class, you’re encouraged to try them outside class.
Exercise 13: Practice 1
Consider the relationship of rides vs weekend and weather_cat.
- Construct a visualization of this relationship.
- Construct a model of this relationship.
- Interpret the first 3 model coefficients.
Exercise 14: Practice 2
Consider the relationship of rides vs temp_feel and humidity.
- Construct a visualization of this relationship.
- Construct a model of this relationship.
- Interpret the first 3 model coefficients.
Exercise 15: Practice 3
Consider the relationship of rides vs temp_feel and weather_cat.
- Construct a visualization of this relationship.
- Construct a model of this relationship.
- Interpret the first 3 model coefficients.
Exercise 16: CHALLENGE
We’ve explored models with 2 predictors. What about 3 predictors?! Consider the relationship of rides vs temp_feel, humidity, AND weekend.
- Construct a visualization of this relationship.
- Construct a model of this relationship.
- Interpret each model coefficient.
Done!
- Finalize your notes: (1) Render your notes to an HTML file; (2) Inspect this HTML in your Viewer – check that your work translated correctly; and (3) Outside RStudio, navigate to your ‘Activities’ subfolder within your ‘STAT155’ folder and locate the HTML file – you can open it again in your browser.
- Clean up your RStudio session: End the rendering process by clicking the ‘Stop’ button in the ‘Background Jobs’ pane.
- Check the solutions in the course website, at the bottom of the corresponding chapter.
- Work on homework!